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Chapter 1 

Graphical Models 



1.1 Introduction 

Probabilistic graphical models combine the graph theory and probability theory 
to give a multivariate statistical modeling. They provide a unified description 
of uncertainty using probability and complexity using the graphical model. Es- 
pecially, graphical models provide the following several useful properties: 

• Graphical models provide a simple and intuitive interpretation of the 
structures of probabilistic models. On the other hand, they can be used 
to design and motivate new models. 

• Graphical models provide additional insights into the properties of the 
model, including the conditional independence properties. 

• Complex computations which are required to perform inference and learn- 
ing in sophisticated models can be expressed in terms of graphical manip- 
ulations, in which the underlying mathematical expressions are carried 
along implicitly. 

The graphical models have been applied to a large number of fields, includ- 
ing bioinformatics, social science, control theory, image processing, marketing 
analysis, among others. However, structure learning for graphical models re- 
mains an open challenge, since one must cope with a combinatorial search over 
the space of all possible structures. 

In this paper, we present a comprehensive survey of the existing structure 
learning algorithms. 

1.2 Preliminaries 

We will first define a set of notations which will be used throughout this paper. 
We represent a graph as G = (V, E) where V = {vi} is the set of nodes in the 
graph and each node corresponds to a random variable a;, € X . E = {(v i} Vj) : 
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i ^ j} is the set of edges. In a directed graph, if there is an edge E i: j from Vi to 



i 

Vj, then Vi is a parent of node Vj and Vj is a child of node Vi. If there is no cycle 
in a directed graph, we call it a Directed Acyclic Graph (DAG). The number of 
nodes and number of edges in a graph are denoted by jV| and \E\ respectively. 
ir(i) is used to represent all the parents of node Vi in a graph. U = {x±, • • • , x n } 
denotes the finite set of discrete random variables where each variable Xi may 
take on values from a finite domain. Val(xi) denotes the set of values that 
variable xi may attain, and \xi\ = \Val{xi)\ denotes the cardinality of this set. 
In probabilistic graphical network, the Markov blanket dvi [Pearl, 1988] of a 
node Vi is defined to be the set of nodes in which each has an edge to Uj, i.e., all 
Vj such that (Vi,Vj) € E. The Markov assumption states that in a probabilistic 
graphical network, every set of nodes in the network is conditionally independent 
of Vi when conditioned on its Markov blanket dvi. Formally, for distinct nodes 
Vi and Vk, 

P(v i \dv i nvk)=P{v i \dv i ) 

The Markov blanket of a node gives a localized probabilistic interpretation of 
the node since it identifies all the variables that shield off the node from the 
rest of the network, which means that the Markov blanket of a node is the 
only information necessary to predict the behavior of that node. A DAG G is 
an I-Map of a distribution P if all the Markov assumptions implied by G are 
satisfied by P. 

Theorem 1.2.1. (Factorization Theorem) If G is an I-Map of P , then 
P(x x , ■■■ >x n ) = P{x i \x lr(i) ) 

i 

According to this theorem, we can represent P in a compact way when G 
is sparse such that the number of parameter needed is linear in the number of 
variables. This theorem is true in the reverse direction. 

The set X is d-separated from set Y given set Z if all paths from a node in 
A to a node in Y are blocked given Z. 

The graphical models can essentially be divided into two groups: directed 
graphical models and undirected graphical models. 



(XX) 



<xx& 



Figure 1.1: An Ising model with 9 nodes. 
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1.3 Undirected Graphical Models 



1.3.1 Markov Random Field 

A Markov Random Field (MRF) is denned as a pair M = (G, $). Here G = 
(V, E) represents an undirected graph, where V — {Vi} is the set of nodes, 
each of which corresponds to a random variable in X ; E = {(Vi, Vj) : i ^ j} 
represents the set of undirected edges. The existence of an edge {u, v} indicates 
the dependency of the random variable u and v. $ is a set of potential functions 
(also called factors or clique potentials) associated with the maximal cliques in 
the graph G. Each potential function 4> c (-) has the domain of some clique c 
in G, and is a mapping from possible joint assignments (to the elements of c) 
to non-negative real values. A maximal clique of a graph is a fully connected 
sub-graph that can not be further extended. We use C to represent the set 
of maximal cliques in the graph. (p c is the potential function for a maximal 
clique c G C. The joint probability of a configuration x of the variables V can 
be calculated as the normalized product of the potential function over all the 
maximal cliques in G: 



P(x) 



Ucec M x c) 
Ex' ILee M*c) 



where x c represents the current configuration of variables in the maximal clique 
c, x' c represents any possible configuration of variable in the maximal clique c. 
In practice, a Markov network is often conveniently expressed as a log-linear 
model, given by 

P ( x )_ ex P(EceC w c0c(x c )) 



In the above equation, <p c are feature functions from some subset of X to real 
values, w c are weights which arc to be determined from training samples. A log- 
linear model can provide more compact representations for any distributions, 
especially when the variables have large domains. This representation is also 
convenient in analysis because its negative log likelihood is convex. However, 
evaluating the likelihood or gradient of the likelihood of a model requires in- 
ference in the model, which is generally computationally intractable due to the 
difficulty in calculating the partitioning function. 

The Ising model is a special case of Markov Random Field. It comes from 
statistical physics, where each node represents the spin of a particle. In an 
Ising model, the graph is a grid, so each edge is a clique. Each node in the 
Ising model takes binary values {0, 1}. The parameters are 6>j representing the 
external field on particle i, and % representing the attraction between particles 



5 



i and j. 9ij = if i and j are not adjacent. The probability distribution is: 



where Z{9) is the partition function. 

1.3.2 Gaussian Graphical Model 

A Gaussian Graphical Model (GGM) models the Gaussian property of multi- 
variate in an undirected graphical topology. Assuming that there are n variables 
and all variables are normalized so that each of them follows a standard Gaus- 
sian distribution. We use X = (xi,--- , x„) to represent the n x 1 column 
matrix. In a GGM, the variables X are assumed to follow a multivariate Gaus- 
sian distribution with covariance matrix S, 



In a Gaussian Graphical Model, the existence of an edge between two nodes 
indicates that these two nodes are not conditionally independent given other 
nodes. Matrix f2 = is called the precision matrix. The non-zeros elements 
in the precision matrix correspond to the edges in the Gaussian Graphical 



The most commonly used directed probabilistic graphical model is Bayesian 
Network [Pearl, 1988], which is a compact graphical representation of joint 
distributions. A Bayesian Network exploits the underlying conditional inde- 
pendencies in the domain, and compactly represent a joint distribution over 
variables by taking advantages of the local conditional independence structures. 
A Bayesian network B = (G, P) is made of two components: a directed acyclic 
graph (DAG) G whose nodes correspond to the random variables, and a set 
of conditional probabilistic distributions (CPD), P(xi\x 7r ^ i - ) ), which describe the 
statistical relationship between each node i and its parents ir(i). In a CPD, for 
any specific configuration of x^w, the sum over all possible values of Xi is 1, 





Model. 



1.4 Directed Graphical Models 



P ( x i\**(i)) = 1- 



xi£Val(xi) 



In the continuous case. 




G 



where P{xi\x. 7T ^) is the conditional density function. The conditional in- 
dependence assumptions together with the CPDs uniquely determine a joint 
probability distribution via the chain rule: 



n 



) = n p (^i x ^)) 



i=l 



p(/^yes) = 0..00001 



p(a=<30) = 0.25 
/>(a=30-50) = 0.40 



p(s=male) = 0.5 




p(g=yeslf=yes) = 0.2 
p{g=yeslf=no) = 0.01 



p(j=yeslf=yes,a=*,s=*) = 0.05 



p(j=yeslf=no,a=<30,s=male) = 0..0001 
p(j=yeslf=no,a=30-50,s=male) = 0.0004 
p(j=yeslf=no,a=>50,s=male) = 0.0002 
p(j=yes If=no,a=<i0,s-female) = 0..0005 
p(j=yeslf=no,a=30-5Q,s=female) = 0.002 
p(j=yes lf=iK>,a=>50,s=female) = 0.001 



Figure 1.2: A Bayesian network for detecting credit-card fraud. Arcs indicate 
the causal relationship. The local conditional probability distributions associ- 
ated with a node are shown next to the node. The asterisk indicates any value 
for that variable. Figure excerpted from [Heckerman et al., 1995]. 

1.4.1 Conditional Probability Distribution 

The CPDs may be represented in different ways. The choice of the represen- 
tation is critical because it specifies the intrinsic nature of the conditional de- 
pendencies as well as the number of parameters needed for this representation. 
Here we describe some different types of CPDs. 

Table CPDs 

In the discrete case, the CPDs can be simply represented as a table in which 
each row corresponds to a specific configuration of a node and its parents, as well 
as the corresponding conditional probability of this configuration [Heckerman 
et al., 1995]. The table CPDs are advantageous in that they are simple and 
clear, but the size of the table CPDs will grow exponentially with the increase 
in the number of parents and the number of values that each node can take. 
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Tree CPDs 

The tree CPDs try to exploit the context specific information (CSI), i.e., the 
distribution over the values of a node does not depend on the value of some 
subset of its parents given the value of the other parents [Boutilicr ct al., 1996]. 
In a tree CPD, each interior vertex represents the splits on the value of some 
parent vertices, and each leaf represents a probability conditioned on a specific 
configuration along the path originated from the root. The tree CPDs usually 
require a substantially smaller number of parameters than table CPDs when 
CSI holds in many places of the Bayesian network. 

Softmax CPDs 

The softmax CPDs approximates the dependency of a discrete variable xi on its 
parents by a linear threshold function Segal [2004]. In this case, the value 
of each node is determined based on the sum of the contributions of the values 
of all its parents, i.e., the effect of ir(i) on node i taking on a value Xi can be 
summarized via a linear function: 



In the above equation, each weight w Xi j represents the contribution of the 
jth parent to the value of the target node i. Given the contribution function f,a 
common choice of how the probability of Xi depends on f Xi (x^^j) is the softmax 
distribution, which is the standard extension of the binary logistic conditional 
distribution to the multi-class case: 



Gaussian CPDs 

In many cases the random variables are continuous with associated density 
functions. A common choice of the density function is Gaussian distribution or 
Normal distribution t <~ N(n,a 2 ): 



The Gaussian CPDs are often incorporated in the table or tree CPDs, in 
which the parameters mu and a of the Gaussian distribution are determined by 
the configuration of node is parents. 




P(xi\x n{l) ) = 



exp(/ x .(x x(i) )) 
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Sigmoid CPDs 



The Sigmoid Belief Networks (SBN) [Neal, 1992, Titov and Henderson, 2007] 
has the CPD in the form: 



where er(-) denotes the logistic sigmoid function, and Jj, is the weight from j to 
i. 

Probability Formulas CPDs 

In a Relational Bayesian Network (RBN) [Jaeger, 1997, 2001], all variables take 
binary values. Each root node i has probability 9i £ [0, 1] to be 1. For each 
non-root node, the probability to taking value 1 is a combination function of the 
values of all its parents. A commonly used combination function is the noisy-or 
function which is defined as noisy-or(J) = 1 — 7r pe /(l — p) where / is a multiset 
of probabilities. 



• Dependency Networks: In [Hcckcrman et al., 2000], the authors proposed a 
probabilistic graphical model named Dependency Networks, which can be 
considered as combination of Bayesian network and Markov network. The 
graph of a dependency network, unlike a Bayesian network, can be cyclic. 
The probability component of a dependency network, like a Bayesian net- 
work, is a set of conditional distributions, one for each node given its 
parents. 

A dependency network is a pair (G, P) where G is a cyclic directed graph 
and P is a set of conditional probability distributions. The parents of 
nodes ir(i) of node i correspond to those variables that satisfy 



In other words, a dependency network is simply a collection of conditional 
distributions that arc defined and built separately. In a specific context 
of sparse normal models, these would define a set of separate conditional 
linear regressions in which xi is regressed to a small selected subset of 
other variables, each being determined separately. 

The independencies in a dependency network are the same as those of 
a Markov network with the same adjacencies. The authors proved that 
the Gibbs sampler applied to the dependency network will yield a joint 
distribution for the domain. The applications of dependency network in- 
clude probabilistic inference, collaborative filtering and the visualization 
of causal predictive relationships. 




1.5 Other Graphical Models 
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• Module Networks: In [Segal et al., 2003], the authors proposed a mod- 
ule networks model for gene regulatory network construction. The basic 
structure is a Bayesian network. Each regulatory module is a set of genes 
that are regulated in concert by a shared regulation program that governs 
their behavior. A regulation program specifies the behavior of the genes 
in the module as a function of the expression level of a small set of regula- 
tors. By employing the Bayesian structure learning to the modules instead 
of genes, this algorithm is able to reduce the computational complexity 
significantly. 

In [Toh and Horimoto, 2002] the authors proposed a model with the sim- 
ilar idea, yet they built a Gaussian Graphical Model instead of Bayesian 
networks, of module networks. In their study of the yeast (Saccharomyces 
cerevisiae) genes measured under 79 different conditions, the 2467 genes 
are first classified into 34 clusters by a hierarchical clustering analysis [Ho- 
rimoto and Toh, 2001]. Then the expression levels of the genes in each 
cluster are averaged for each condition. The averaged expression profile 
data of 34 clusters were subjected to GGM, and a partial correlation co- 
efficient matrix was obtained as a model of the genetic network. 

• Probabilistic Relational Models: A probabilistic relational model [Fried- 
man ct al., 1999a] is a probabilistic description of the relational models, 
like the models in relational databases. A relational model consists of a set 
of classes and a set of relations. Each entity type is associated with a set of 
attributes. Each attribute takes on values in some fixed domain of values. 
Each relation is typed. The probabilistic relational model describes the 
relationships between entities and the properties of entities. The model 
consists of two components: the qualitative dependency structure which 
is a DAG, and the parameters associated with it. The dependency struc- 
ture is defined by associating with each attribute and its parents, which 
is modeled as conditional probabilities. 

1.6 Network Topology 

Two classes of network architectures are of special interest to system biology [Ki- 
tano, 2002]: the small world networks [Watts and Strogatz, 1998] and scall-free 
power law networks [Barabasi and Albert, 1999]. Small world networks are 
characterized by high clustering coefficients and small diameters. The clus- 
tering coefficient C(p) is defined as follows. Suppose that a vertex v has k v 
neighbors; then at most k v (k v — l)/2 edges can exist between them (this occurs 
when every neighbor of v is connected to every other neighbor of v). Let C v de- 
note the fraction of these allowable edges that actually exist, then the clustering 
coefficient C is defined as the average of C v over all v. 

These properties reflect the existence of local building blocks together with 
long-range connectivity. Most nodes in small world networks have approxi- 
mately the same number of links, and the degree distribution P{k) decays ex- 
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ponentially for large k. Compared to small world networks, the scale-free power 
law networks have smaller clustering coefficients and large diameters. Most 
nodes in the scale-free networks are connected to a few neighbors, and only a 
small number of nodes, which is often called "hubs" , are connected to a large 
number of nodes. This property is reflected by the power law for the degree 
distribution P(k) ~ k~ v . 

Previous studies have found that a number of network structures appear to 
have structures between the small-world network and the scale-free network. In 
fact, these networks behave more like hierarchical scale-free [Han et al., 2004, 
Jeong et al., 2000, Lukashin et al., 2003, Basso ct al., 2005, Bhan et al., 2002, 
Ravasz et al. , 2002] . Nodes within the networks are first grouped into modules, 
whose connectivity is more like the small worlds network. The grouped modules 
are then connected into a large network, which follows the degree distribution 
that is similar to that of the scale-free network. 

1.7 Structure Learning of Graphical Models 

There are three major approaches of existing structure learning methods: constraint- 
based approaches, score-based approaches and regression-based approaches. 

Constraint-based approaches first attempt to identify a set of conditional 
independence properties, and then attempt to identify the network structure 
that best satisfies these constraints. The drawback with the constraints based 
approaches is that it is difficult to reliably identify the conditional independence 
properties and to optimize the network structure [Margaritis, 2003]. Plus, the 
constraints-based approaches lack an explicit objective function and they do 
not try to directly find the globally optimal structure. So they do not fit in the 
probabilistic framework. 

Score-based approaches first define a score function indicating how well the 
network fits the data, then search through the space of all possible structures 
to find the one that has the optimal value for the score function. Problem with 
this approach is that it is intractable to evaluate the score for all structures, so 
usually heuristics, like greedy search, are used to find the sub-optimal structures. 
Regression-based approaches are gaining popularity in recent years. Algorithms 
in this category are essentially optimization problems which guarantees global 
optimum for the objective function, and have better scalability. 

Regression-based approaches are gaining popularity in recent years. Algo- 
rithms in this category are essentially optimization problems which guarantees 
global optimum for the objective function, and have better scalability. 
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Chapter 2 



Constraint-based 
Algorithms 

The constraint-based approaches [Tsamardinos et al., 2006, Juliane and Ko- 
rbinian, 2005, Spirtcs et al., 2000, Wille et al., 2004, Margaritis, 2003, Margari- 
tis and Thrun, 1999] employ the conditional independence tests to first identify 
a set of conditional independence properties, and then attempts to identify the 
network structure that best satisfies these constraints. The two most popular 
constraint-based algorithm are the SGS algorithm and PC algorithm [Tsamardi- 
nos et al., 2006], both of which tries to d-separate all the variable pairs with all 
the possible conditional sets whose sizes are lower than a given threshold. 

One problem with constraint-based approaches is that they are difficult to 
reliably identify the conditional independence properties and to optimize the 
network structure [Margaritis, 2003]. The constraint-based approaches lack an 
explicit objective function and they do not try to directly find the global struc- 
ture with maximum likelihood. So they do not fit in the probabilistic framework. 

2.1 The SGS Algorithm 

The SGS algorithm (named after Spirtes, Glymour and Scheines) is the most 
straightforward constraint-based approach for Bayesian network structure learn- 
ing. It determines the existence of an edge between every two node variables by 
conducting a number of independence tests between them conditioned on all the 
possible subsets of other node variables. The pseudo code of the SGS algorithm 
is listed in Algorithm 1. After slight modification, SGS algorithm can be used 
to learn the structure of undirected graphical models (Markov random fields). 

The SGS algorithm requires that for each pair of variables adjacent in G, 
all possible subsets of the remaining variables should be conditioned. Thus this 
algorithm is super-exponential in the graph size (number of vertices) and thus 
unscalable. The SGS algorithm rapidly becomes infeasible with the increase 
of the vertices even for sparse graphs. Besides the computational issue, the 
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Algorithm 1 SGS Algorithm 
1: Build a complete undirected graph H on the vertex set V . 
2: For each pair of vertices i and j, if there exists a subset S of V \ {i,j} such 
that i and j are d-separated given S, remove the edge between i and j from 
G. 

3: Let G' be the undirected graph resulting from step 2. For each triple of 
vertices i, j and k such that the pair i and j and the pair j and k are each 
adjacent in G' (written as i — j — k) but the pair i and k are not adjacent 
in G', orient i — j — k as i — > j <— fc if and only if there is no subset S of 
{j} U V \ {i, j} that d-separate i and k. 

4: repeat 

5: If i — ^ j, j and k are adjacent, i and k are not adjacent, and there is no 

arrowhead at j, then orient j — k as j — > k. 
6: If there is a directed path from i to j, and an edge between i and j, then 

orient i — j as i — > j. 
7: until no more edges can be oriented. 



SGS algorithm has problems of reliability when applied to sample data, because 
determination of higher order conditional independence relations from sample 
distribution is generally less reliable than is the determination of lower order 
independence relations. 

2.2 The PC Algorithm 

The PC algorithm (named after Peter Spirtes and Clark Glymour) is a more 
efficient constraint-based algorithm. It conducts independence tests between all 
the variable pairs conditioned on the subsets of other node variables that are 
sorted by their sizes, from small to large. The subsets whose sizes are larger 
than a given threshold are not considered. The pseudo-code of the PC algorithm 
is given in Algorithm 2. We use to denote the adjacent vertices to vertex 
j in a directed acyclic graph G. 

The complexity of the PC algorithm for a graph G is bounded by the largest 
degree in G. Suppose d is the maximal degree of any vertex and n is the 
number of vertices. In the worst case the number of conditional independence 
tests required by the PC algorithm is bounded by 




The PC algorithm can be applied on graphs with hundreds of nodes. How- 
ever, it is not scalable if the number of nodes gets even larger. 
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Algorithm 2 PC Algorithm 
1: Build a complete undirected graph G on the vertex set V . 

2: n = 0. 

3: repeat 
4: repeat 

5: Select an ordered pair of vertices i and j that are adjacent in G such 
that Af(i) \ {j} has cardinality greater than or equal to n, and a subset 
S of Af(i) \ {j} of cardinality n, and if i and j are d-separated given S 
delete edge i — j from G and record S in Sepset(i,j) and Sepset(j,i). 
6: until all ordered pairs of adjacent variables i and j such that 7V(i)\{j}has 
cardinality greater than or equal to n and all subsets S of Af(i) \ {j}of 
cardinality n have been tested for d-separation. 
7: n = n + 1. 

8: until for each ordered pair of adjacent vertices i and j, Af(i) \ {j} is of 
cardinality less than n. 

9: For each triple of vertices i, j and k such that the pair i, j and the pair 
j, k are each adjacent in G but the pair i, k are not adjacent in G, orient 
i — j — k as i — > j <— fc if and only if j is not in Sepset(i, k). 
10: repeat 

11: If i — > j, j and fc are adjacent, i and fc are not adjacent, and there is no 

arrowhead at j, then orient j — k as j — > k. 
12: If there is a directed path from i to j, and an edge between i and j, then 

orient i — j as i — >• j. 
13: until no more edges can be oriented. 



2.3 The GS Algorithm 

Both the SGS and PC algorithm start from a complete graph. When the number 
of nodes in the graph becomes very large, even PC algorithm will be intractable 
due to the large combinatorial space. 

In Margaritis and Thrun [1999], the authors proposed a Crow-Shrinkage 
(GS) algorithm to address the large scale network structure learning problem 
by exploring the sparseness of the graph. The GS algorithm use two phases to 
estimate a superset of the Markov blanket d(j) for node j as in Algorithm 3. In 
the pseudo code, i j denotes that node i and j are dependent conditioned 
on set S. 

Algorithm 3 includes two phases to estimate the Markov blanket. In the 
"grow" phase, variables are added to the Markov blanket d(j) sequentially using 
a forward feature selection procedure, which often results in a superset of the real 
Markov blanket. In the "shrinkage" phase, variables are deleted from the d(j) if 
they are independent from the target variable conditioned on the subset of other 
variables in d(j). Given the estimated Markov blanket, the algorithm then tries 
to identify both the parents and children for each variable as in Algorithm 4. 

In [Margaritis and Thrun, 1999] , the authors further developed a randomized 
version of the GS algorithm to handle the situation when 1) the Markov blanket 
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Algorithm 3 GS: Estimating the Markov Blanket 

1: S <- W. 

2: while 3 j G V \ {i} such that j ^ s i do 
3: S^SU{j}. 
4: end while 

5: while 3j G S such that j ^s\{i} i do 

6: S^S\{j}. 

7: end while 

8: d(i) «- S* 



Algorithm 4 GS Algorithm 
1: Compute Markov Blankets: for each vertex i G V compute the Markov 
blanket <9(i). 

2: Compute Graph Structure: for all i G V and j G determine j to be a 
direct neighbor of i if i and j are dependent given S for all S C T where T 
is the smaller of d(i) \ {j} and \ {i}. 

3: Orient Edges: for all z G V" and j G <9(i), orient j — > i if there exists a 
variable k G \ {d(j) U {j}} such that j and fc are dependent given 
S U {i} for all S C [/ where {/ is the smaller of \ {fc} and \ {j}. 

4: repeat 

5: Compute the set of edges C = {i-> jsuch thati — > j is part of a cycle}. 
6: Remove the edge in C that is part of the greatest number of cycles, and 

put it in R. 
7: until there is no cycle exists in the graph. 

8: Reverse Edges: Insert each edge from R in the graph, reversed. 

9: Propagate Directions: for all j € V and j G d(i) such that neither j — > i nor 

i — > j, execute the following rule until it no longer applies: if there exists a 

directed path from i to j, orient i — > j. 



is relatively large, 2) the number of training samples is small compared to the 
number of variables, or there are noises in the data. 

In a sparse network in which the Markov blankets are small, the complexity 
of GS algorithm is 0(n 2 ) where n is the number of nodes in the graph. Note 
that GS algorithm can be used to learn undirected graphical structures (Markov 
Random Fields) after some minor modifications. 
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Chapter 3 

Score-based Algorithms 



Score-based approaches [Heckcrman et al., 1995, Friedman et al., 1999b, Hartemink 
et al., 2001] first posit a criterion by which a given Bayesian network structure 
can be evaluated on a given dataset, then search through the space of all possi- 
ble structures and tries to identify the graph with the highest score. Most of the 
score-based approaches enforce sparsity on the learned structure by penalizing 
the number of edges in the graph, which leads to a non-convex optimization 
problem. Score-based approaches are typically based on well established statis- 
tical principles such as minimum description length (MDL) [Lam and Bacchus, 
1994, Friedman and Goldszmidt, 1996, Allen and Greiner, 2000] or the Bayesian 
score. The Bayesian scoring approaches was first developed in [Cooper and 
Herskovits, 1992], and then refined by the BDe score [Heckerman et al., 1995], 
which is now one the of best known standards. These scores offer sound and well 
motivated model selection criteria for Bayesian network structure. The main 
problem with score based approaches is that their associated optimization prob- 
lems are intractable. That is, it is NP-hard to compute the optimal Bayesian 
network structure using Bayesian scores [Chickering, 1996]. Recent researches 
have shown that for large samples, optimizing Bayesian network structure is NP- 
hard for all consistent scoring criteria including MDL, BIC and the Bayesian 
scores [Chickering et al., 2004]. Since the score-based approaches are not scal- 
able for large graphs, they perform searches for the locally optimal solutions in 
the combinatorial space of structures, and the local optimal solutions they find 
could be far away from the global optimal solutions, especially in the case when 
the number of sample configurations is small compared to the number of nodes. 

The space of candidate structures in scoring based approaches is typically 
restricted to directed models (Bayesian networks) since the computation of typi- 
cal score metrics involves computing the normalization constant of the graphical 
model distribution, which is intractable for general undirected models [Pollard, 
1984]. Estimation of graph structures in undirected models has thus largely 
been restricted to simple graph classes such as trees [Chow et al., 1968], poly- 
trees [Chow et al., 1968] and hypertrees [Srebro, 2001]. 
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3.1 Score Metrics 



3.1.1 The MDL Score 



The Minimum Description Length (MDL) principle [Rissancn, 1989] aims to 
minimize the space used to store a model and the data to be encoded in the 
model. In the case of learning Bayesian network B which is composed of a graph 
G and the associated conditional probabilities P Bl the MDL criterion requires 
choosing a network that minimizes the total description length of the network 
structure and the encoded data, which implies that the learning procedure bal- 
ances the complexity of the induced network with the degree of accuracy with 
which the network represents the data. 

Since the MDL score of a network is defined as the total description length, 
it needs to describe the data U, the graph structure G and the conditional 
probability P for a Bayesian network B = (G, P). 

To describe U, we need to store the number of variables n and the cardinality 
of each variable xi. We can ignore the description length of U in the total 
description length since U is the same for all candidate networks. 

To describe the DAG G, we need to store the parents ir(i) of each variable 
Xi. This description includes the number of parents 1 7r(i) | and the index of the 
set ir(i) in some enumeration of all se ts of this cardinality. Since the 

number of parents |7r(i)| can be encoded in logn bits, and the indices of all 
parents of node i can be encoded in log ( 7r " i j) bits, the description length of the 
graph structure G is 



To describe the conditional probability P in the form of CPD, we need to store 
the parameters in each conditional probability table. The number of param- 
eters used for the table associated with Xi is |7r(i)|(|a;j| — 1). The description 
length of these parameters depends on the number of bits used for each numeric 
parameter. A usual choice is 1/2 log N [Friedman and Goldszmidt, 1996]. So 
the description length for x,'s CPD is 



To describe the encoding of the training data, we use the probability measure 
defined by the network B to construct a Huffman code for the instances in D. 
In this code, the length of each codeword depends on the probability of that 
instance. According to [Cover and Thomas, 1991], the optimal encoding length 
for instance Xi can be approximated as — log P Xi . So the description length of 




(3.1) 



DL tab (xi) = - 



n(i)\(\ Xi \-l)logN 
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the data is 

N 

DL data {D\B) = -^logPfc) 

i=l 

= #(^,x 7r(l) )logP(a; 4 |x 7r(4) ). 

i x i ,x 7r(i ) 

In the above equation, (i^x,,^) is a local configuration of variable Xi and 
its parents, #(xi,x 7r ( i )) is the number of the occurrence of this configuration 
in the training data. Thus the encoding of the data can be decomposed as the 
sum of terms that are "local" to each CPD, and each term only depends on the 
counts #(a; i ,x w ( i )). 

If P(x i |x 7r ( i )) is represented as a table, then the parameter values that mini- 
mize DLdata{D\B) are 6 Xi \ x ^ ( .. — P(xi\x 7T ^ i - ) ) [Friedman and Goldszmidt, 1998]. 
If we assign parameters accordingly, then DL data (D\B) can be rewritten in terms 
of conditional entropy as N J2i H{ x i\ x w(i))i where 

H{X\Y) = -^2P(x,y) log P(x\y) 

x,y 

is the conditional entropy of X given Y . The new formula provides an information- 
theoretic interpretation to the representation of the data: it measures how many 
bits are necessary to encode the values of Xi once we know x T (,). 

Finally, by combining the description lengths above, we get the total de- 
scription length of a Bayesian network as 

DL(G, D) = DL graph (G) + ]T DL tab ( Xi ) + H{ Xi \^ {i) ) (3.2) 

i i 



3.1.2 The BDe Score 

The Bayesian score for learning Bayesian networks can be derived from methods 
of Bayesian statistics, one important example of which is BDe score [Cooper and 
Herskovits, 1992, Heckerman et al., 1995]. The BDe score is proportional to the 
posterior probability of the network structure given the data. Let G h denote the 
hypothesis that the underlying distribution satisfies the independence relations 
encoded in G. Let &g represent the parameters for the CPDs qualifying G. By 
Bayes rule, the posterior probability P(G h \D) is 

P[G lD) - P(D) 

In the above equation, 1/P(D) is the same for all hypothesis, and thus we 
denote this constant as a. The term P(D\G h ) is the probability given the 
network structure, and P(G h ) is the prior probability of the network structure. 
They are computed as follows. 
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The prior over the network structures is addressed in several literatures. 
In [Heckerman et al., 1995], this prior is chosen as P{G h ) oc a A ( G ' G \ where 
A(G, G') is the difference in edges between G and a prior network structure G', 
and < a < 1 is the penalty for each edge. In [Friedman and Goldszmidt, 1998], 
this prior is set as P(G h ) oc 2~ DL 9 ra P h ( G \ where DL grap h(G) is the description 
length of the network structure G, defined in Equation 3.1. 

The evaluation of P(D\G h ) needs to consider all possible parameter assign- 
ments to G, namely 

p(D\G h ) = J p(D\e G ,G h )p(e G \G h )de G , (3.3) 

where P(D\Qc,G h ) is the probability of the data given the network structure 
and parameters. P(<d G \G h ) is the prior probability of the parameters. Under 
the assumption that each distribution P(xi\x w ^) can be learned independently 
of all other distributions [Heckerman et al., 1995], Equation 3.3 can be written 

as 

p(D\G h )=nn f nCw x ' (i>)p (^^i Gft )^^- 

Note that this decomposition is analogous to the decomposition in Equation 3.2. 
In [Heckerman et al., 1995], the author suggested that each multinomial distri- 
bution takes a Dirichlet prior, such that 



P(6x) = /3n# 

X 

where N' x : x € Val(X) is a set of hyper parameters, (3 is a normalization 
constant. Thus, the probability of observing a sequence of values of X with 
counts N(x) is 

j\[e x P (o x \G )do x - f^jNTTmr) 1 } m> 

where is the Gamma function defined as 

/■OO 

F(x) = / t x - l e- l At 
Jo 

The Gamma function has the following properties: 

r(i) = i 

T(x + 1) = xT(x) 
If we assign each 9» )7r (») a Dirichlet prior with hyperparameters N then 

p(n\r h ) = 17 17 r ^ iJv ^f^ tt E^kiLj!Mjj)j 

\ u > 1111 IYV.JV' +JV(7r(t))) 11 T(N! , -0 
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3.1.3 Bayesian Information Criterion (BIC) 

A natural criterion that can be used for model selection is the logarithm of the 
relative posterior probability: 

log P(D, G) = log P(G) + log P(D\G) (3.4) 

Here the logarithm is used for mathematical convenience. An equivalent 
criterion that is often used is: 



P(G„\D)) "\P(G )J B \P{D\G a ) 

The ratio P(D\G) / P(D\Gq) in the above equation is called B ayes factor [Kass 
and Raftery, 1995]. Equation 3.4 consists of two components: the log prior of 
the structure and the log posterior probability of the structure given the data. 
In the large-sample approximation we drop the first term. 

Let us examine the second term. It can be expressed by marginalizing all 
the assignments of the parameters 9 of the network: 

log p(d\g) = io g / p(L>|G,e)p(e|G)de (3.5) 

Je 

In [Kass and Raftery, 1995], the authors proposed a Gaussian approximation 
for P(Q\D, G) oc P{D\Q, G)P(0|G) for large amounts of data. Let 

ff (e) = io g (P(D|e,G)P(e|G)) 

We assume that is the maximum a posteriori (MAP) configuration of 
for P(Q\D,G), which also maximizes <?(©). Using the second degree Taylor 
series approximation of g(Q) at 0: 

.9(0) « fl (0) -1(9- 0)A(0 -0) T 

Where A is the negative Hessian of <?(0) at 0. Thus we get 
P(B\D,G) oc P(D\e,G)P(Q,G) 

» P(r,|0,S)P(0|S)expQ(0-0)A(0-0) T ) (3.6) 

So we approximate P(Q\D, G) as a multivariate Gaussian distribution. Plugging 
Equation 3.6 into Equation 3.5 and we get: 

log P(D\G) « log P(D\Q, G) + log P(0|G) + d - log(27r) - 1 log \A\ (3.7) 

where d is the dimension of g(0). In our case it is the number of free parameters. 

Equation 3.7 is called a Laplace approximation, which is a very accurate 
approximation with relative error 0(1 /N) where N is the number of samples in 
D [Kass and Raftery, 1995]. 
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However, the computation of \A\ is a problem for large-dimension models. 
We can approximate it using only the diagonal elements of the Hessian A, in 
which case we assume independencies among the parameters. 

In asymptotic analysis, we get a simpler approximation of the Laplace ap- 
proximation in Equation 3.7 by retaining only the terms that increase with 
the number of samples N: logP(D\Q,G) increases linearly with TV; log \A\ in- 
creases as dlogN. And G can be approximated by the maximum likelihood 
configuration G. Thus we get 

logP(D\G)^P(D\Q,S)-^logN (3.8) 

This approximation is called the Bayesian Information Criterion (BIC) [Schwarz, 
1978]. Note that the BIC does not depend on the prior, which means we can 
use the approximation without assessing a prior. The BIC approximation can 
be intuitively explained: in Equation 3.8 , log P(D\Q, G) measures how well the 
parameterized structure predicts the data, and (d/2\ogN) penalizes the com- 
plexity of the structure. Compared to the Minimum Description Length score 
defined in Equation 3.2, the BIC score is equivalent to the MDL except term of 
the description length of the structure. 

3.2 Search for the Optimal Structure 

Once the score is defined, the next task is to search in the structure space 
and find the structure with the highest score. In general, this is an NP-hard 
problem [Chickering, 1996]. 

Note that one important property of the MDL score or the Bayesian score 
(when used with a certain class of factorized priors such as the BDc priors) is 
the decompos ability in presence of complete data, i.e., the scoring functions we 
discussed earlier can be decomposed in the following way: 

Scorc(G : D)=y~] Scored |x T(i) : N Xi ^ {i) ) 

i 

where N Xi ^ {i) is the number of occurrences of the configuration Xi,x n ^y 

The decomposability of the scores is crucial for score-based learning of struc- 
tures. When searching the possible structures, whenever we make a modifica- 
tion in a local structure, we can readily get the score of the new structure by 
re-evaluating the score at the modified local structure, while the scores of the 
rest part of the structure remain unchanged. 

Due to the large space of candidate structures, simple search would in- 
evitably leads to local maxima. To deal with this problem, many algorithms 
were proposed to constrain the candidate structure space. Here they are listed 
as follows. 
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Algorithm 5 Hill-climbing search for structure learning 
1: Initialize a structure G' . 
2: repeat 
3: Set G = G'. 

4: Generate the acyclic graph set Neighbor (G) by adding, removing or re- 
versing an edge in graph G. 
5: Choose from Neighbor(G) the one with the highest score and assign to 



6: until Convergence. 



3.2.1 Search over Structure Space 

The simplest search algorithm over the structure is the greedy hill-climbing 
search [Hcckerman ct al., 1995]. During the hill-climbing search, a series of 
modifications of the local structures by adding, removing or reversing an edge are 
made, and the score of the new structure is reevaluated after each modification. 
The modifications that increase the score in each step is accepted. The pseudo- 
code of the hill-climbing search for Bayesian network structure learning is listed 
in Algorithm 5. 

Besides the hill-climbing search, many other heuristic searching methods 
have also been used to learn the structures of Bayesian networks, including the 
simulated annealing [Chickering and Boutilier, 1996], best-first search [Russel 
and Norvig, 1995] and genetic search [Larranaga et al., 1996]. 

A problem with the generic search procedures is that they do not exploit the 
knowledge about the expected structure to be learned. As a result, they need to 
search through a large space of candidate structures. For example, in the hill- 
climbing structure search in Algorithm 5, the size of Neighbor (G) is 0(n 2 ) where 
n is the number of nodes in the structure. So the algorithm needs to compute the 
scores of 0(n 2 ) candidate structures in each update (the algorithm also need 
to check acyclicity of each candidate structure), which renders the algorithm 
unscalable for large structures. 

In [Friedman et al., 1999b], the authors proposed a Sparse Candidate Hill 
Climbing (SCHC) algorithm to solve this problem. The SCHC algorithm first 
estimates the possible candidate parent set for each variable and then use hill- 
climbing to search in the constrained space. The structure returned by the 
search can be used in turn to estimate the possible candidate parent set for 
each variable in the next step. 

The key in SCHC is to estimate the possible parents for each node. Early 
works [Chow et al., 1968, Sahami, 1996] use mutual information to determine 
if there is an edge between two nodes: 



G' . 




where P(-) is the observed frequencies in the dataset. A higher mutual in- 
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formation indicates a stronger dependence between X and Y . Yet this measure 
is not suitable be used to determine the existence of an edge between two nodes 
has problems because, for example, it does not consider the information that we 
already learnt about the structure. Instead, Friedman et al. [1999b] proposed 
two other metrics to evaluate the dependency of two variables. 

• The first metric is based on an alternative definition of mutual informa- 
tion. The mutual information between X and Y is defined as the distance 
between the joint distribution of P(X, Y) and the distribution P(X)P(Y), 
which assumes the independency of the two variables: 

I(X-Y) = D KL (P{X,Y)\\P{X)P(Y)) 
where D KL {P\\Q) is the Kullback-Leibler divergence defined as: 
D KL (P(X)\\Q(X)) = £ P(X) log g|| 

Under this definition, the mutual information measures the error we in- 
troduce if we assume the independency of X and Y . During each step 
of the search process, we already have an estimation of the network B. 
To utilize this information, similarly, we measure the discrepancy between 
the estimation Pg(X,Y) and the empirical estimation P(X,Y) as: 

D KL (P(X)\\Q(X)) = £ P(X) log 

One issue with this measure is that it requires to compute Pg(Xj,l^) for 
pairs of variables. When learning networks over large number of variables 
this can be computationally expensive. However, one can easily approxi- 
mate these probabilities by using a simple sampling approach. 

• The second measure utilizes the Markov property that each node is in- 
dependent of other nodes given its Markov blanket. First the conditional 
mutual information is defined as: 

I(X- Y\Z) = J2 P(Z)D KL (P(X, Y\Z)\\P(X\Z)P(Y\Z)). 

z 

This metric measures the error that is introduced when assuming the 
conditional independence of X and Y given Z. Based upon this, another 
metric is defined as: 

M s hieid(Xi,Xj\B) — I(Xi;Xj\X n ^) 

Note that using either of these two metrics for searching, at the beginning of 
the search, i.e., Bq is an empty network, the measure is equivalent to I(X;Y). 
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Later iterations will incorporate the already estimated network structure in 
choosing the candidate parents. 

Another problem with the hill-climbing algorithm is the stopping criteria for 
the search. There are usually two types of stopping criteria: 

• Score-based criterion: the search process terminates when Score(Bt) = 
Score(Bt-i). In other words, the score of the network can no longer be 
increased by updating the network from candidate network space. 

• Candidate-based criterion: the search process terminates when C\ — C\~ x 
for all i, that is, the candidate space of the network remains unchanged. 

Since the score is a monotonically increasing bounded function, the score- 
based criterion is guaranteed to stop. The candidate-based criterion might enter 
a loop with no ending, in which case certain heuristics are needed to stop the 
search. 

There are four problems with the SCHC algorithm. First, the estimation of 
the candidate sets is not sound (i.e., may not identify the true set of parents), 
and it may take a number of iterations to converge to an acceptable approx- 
imation of the true set of parents. Second, the algorithm needs a pre-defined 
parameter k, the maximum number of parents allowed for any node in the net- 
work. If k is underestimated, there is a risk of discovering a suboptimal network. 
On the other hand, if k is overestimated, the algorithm will include unnecessary 
parents in the search space, thus jeopardizing the efficiency of the algorithm. 
Third, as already implied above, the parameter k imposes a uniform sparseness 
constraint on the network, thus may sacrifice cither efficiency or quality of the 
algorithm. A more efficient way to constrain the search space is the Max-Min 
Hill-Climbing (MMHC) algorithm [Tsamardinos et al., 2006], a hybrid algorithm 
which will be explained in Section 5.1. The last problem is that the constraint 
of the maximum number of parents k will conflict with the scale-free networks 
due to the existence of hubs (this problem exists for any algorithm that imposes 
this constraint). 

Using the SCHC search, the number of candidate structures in each update 
is reduced from 0(n 2 ) to 0(n) where n is the number of nodes in the structure. 
Thus, the algorithm is capable to learn large-scale structures with hundreds of 
nodes. 

The hill-climbing search is usually applied with multiple restarts and tabu 
list [Cvijovicacute and Klinowski, 1995]. Multiple restarts are used to avoid 
local optima, and the tabu list is used to record the path of the search so as to 
avoid loops and local minima. 

To solve the problem of large candidate structure space and local optima, 
some other algorithms are proposed as listed in the following. 

• In [Moore and keen Wong, 2003], the authors proposed a search strategy 
based on a more complex search operator called optimal reinsertion. In 
each optimal reinsertion, a target node in the graph is picked and all arcs 
entering or exiting the target are deleted. Then a globally optimal com- 
bination of in-arcs and out-arcs are found and reinserted into the graph 
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subject to some constraints. With the optimal reinsertion operation de- 
fined, the search algorithm generates a random ordering of the nodes and 
applies the operation to each node in the ordering in turn. This proce- 
dure is iterated, each with a newly randomized ordering, until no change 
is made in a full pass. Finally, a conventional hill-climbing is performed to 
relax the constraint of max number of parents in the optimal reinsertion 
operator. 

• In [Xiang et al., 1997], the authors state that with a class of domain 
models of probabilistic dependency network, the optimal structure can not 
be learned through the search procedures that modify a network structure 
one link at a time. For example, given the XOR nodes there is no benefit in 
adding any one parent individually without the others and so single-link 
hill-climbing can make no meaningful progress. They propose a multi- 
link lookahead search for finding decomposable Markov Networks (DMN). 
This algorithm iterates over a number of levels where at level i, the current 
network is continually modified by the best set of i links until the entropy 
decrement fails to be significant. 

• Some algorithms identify the Markov blanket or parent sets by either 
using conditional independency test, mutual information or regression, 
then use hill-climbing search over this constrained candidate structure 
space [Tsamardinos et al., 2006, Schmidt and Murphy, 2007]. These al- 
gorithms belong to the hybrid methods. Some of them are listed in Sec- 
tion 5.1. 

3.2.2 Search over Ordering Space 

The acyclicity of the Bayesian network implies an ordering property of the struc- 
ture such that if we order the variables as (xi, • • • , x„), each node x, would have 
parents only from the set {xi, • • • , Xj_i}. Fundamental observations [Buntine, 
1991, Cooper and Hcrskovits, 1992] have shown that given an ordering on the 
variables in the network, finding the highest-scoring network consistent with the 
ordering is not NP-hard. Indeed, if the in-degree of each node is bounded to k 
and all structures are assumed to have equal probability, then this task can be 
accomplished in time 0(n k ) where n is the number of nodes in the structure. 

Search over the ordering space has some useful properties. First, the order- 
ing space is significantly smaller than the structure space: 2°(™ log ™) orderings 
versus 2 n (™ ) structures where n is the number of nodes in the structure [Robin- 
son, 1973]. Second, each update in the ordering search makes a more global 
modification to the current hypothesis and thus has more chance to avoid local 
minima. Third, since the acyclicity is already implied in the ordering, there is 
no need to perform acyclicity checks, which is potentially a costly operation for 
large networks. 

The main disadvantage of ordering-based search is the need to compute a 
large set of sufficient statistics ahead of time for each variable and each possible 
parent set. In the discrete case, these statistics are simply the frequency counts 
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of instantiations: for each Xi € Val(xi) and x,^ G Va^x,^)). This 

cost would be very high if the number of samples in the dataset is large. How- 
ever, the cost can be reduced by using AD-tree data structure [Moore and Lee, 
1998], or by pruning out possible parents for each node using SCHC [Friedman 
et al., 1999b], or by sampling a subset of the dataset randomly. 

Here some algorithms that search through the ordering space are listed: 

• The ordering-based search was first proposed in [Larranaga et al., 1996] 
which uses a genetic algorithm search over the structures, and thus is very 
complex and not applicable in practice. 

• In [Friedman and Koller, 2003], the authors proposed to estimate the 
probability of a structural feature (i.e., an edge) over the set of all orderings 
by using a Markov Chain Monte Carlo (MCMC) algorithm to sample over 
the possible orderings. The authors asserts that in the empirical study, 
different runs of MCMC over network structure typically lead to very 
different estimates in the posterior probabilities over network structure 
features, illustrating poor convergence to the stationary distribution. By 
contrast, different runs of MCMC over orderings converge reliably to the 
same estimates. 

• In [Teyssier and Koller, 2005], the authors proposed a simple greedy local 
hill-climbing with random restarts and a tabu list. First the score of an 
ordering is defined as the score of the best network consistent with it. 
The algorithm starts with a random ordering of the variables. In each 
iteration, a swap operation is performed on any two adjacent variables 
in the ordering. Thus the branching factor for this swap operation is 
0(n). The search stops at a local maximum when the ordering with the 
highest score is found. The tabu list is used to prevent the algorithm 
from reversing a swap that was executed recently in the search. Given an 
ordering, the algorithm then tries to find the best set of parents for each 
node using the Sparse Candidate algorithm followed by exhaustive search. 

• In [Koivisto, 2004, Singh and Moore, 2005], the authors proposed to use 
Dynamic Programming to search for the optimal structure. The key in 
the dynamic programming approach is the ordering -<!, and the marginal 
posterior probability of the feature / 

p(f \ ^) = J2p^ \ x )P(f\ x ^) 

-< 

Unlike [Friedman and Koller, 2003] which uses MCMC to approximate 
the above value, the dynamic programming approach does exact summa- 
tion using the permutation tree. Although this approach may find the 
exactly best structure, the complexity is 0(n2 n + n k+1 C(m)) where n is 
the number of variables, k is a constant in-degree, and C(m) is the cost 
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of computing a single local marginal conditional likelihood for m data in- 
stances. The authors acknowledge that the algorithm is feasible only for 
n < 26. 
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Chapter 4 



Regression-based 
Algorithms 

4.1 Regression Model 

Given N sample data points as (xi,yi) and pre-defined basis functions </>(•); the 
task of regression is to find a set of weights w such that the basis functions 
give the best prediction of the label yi from the input a;,. The performance 
of the prediction is given by an loss function Eo(w). For example, in a linear 
regression, 

1 N 

^(w) = -^(y 4 -w T ^(x,)) 2 (4.1) 

To avoid over-fitting, a regularizer is usually added to penalize the weights w. 
So the regularized loss function is: 

E(w) = E D (w) + XE w (w) (4.2) 

The regularizer penalizes each element of w: 

M 

E w (w) =^0^11^11, 
When all a^'s are the same, then 

E W (W) = \\WjWg 

where || • || g is the L q norm, A is the rcgularization coefficient that controls 
the relative importance of the data- dependent error and the regularization term. 
With different values of q, the regularization term may give different results: 
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1. When q — 2, the regularizer is in the form of sum-of-squares 



E w (w) 



J- T 

= -w w 
2 



This particular choice of regularizer is known in machine learning liter- 
ature as weight decay [Bishop, 2006] because in sequential learning al- 
gorithm, it encourages weight values to decay towards zeros, unless sup- 
ported by the data. In statistics, it provides an example of a parameter 
shrinkage method because it shrinks parameter values towards zero. 

One advantage of the L 2 regularizer is that it is rotationally invariant in 
the feature space. To be specific, given a deterministic learning algorithm 
L, it is rotationally invariant if, for any training set S, rotational matrix 
M and test example x, there is L[S](x) = L[MS](Mx). More generally, if 
L is a stochastic learning algorithm so that its predictions are random, it 
is rotationally invariant if, for any S, M and x, the prediction L[S](x) and 
L[MS](Mx) have the same distribution. A complete proof in the case of 
logistic regression is given in [Ng, 2004]. 

This quadratic (X 2 ) regularizer is convex, so if the loss function being 
optimized is also a convex function of the weights, then the regularized 
loss has a single global optimum. Moreover, if the loss function Ed(w) 
is in quadratic form, then the minimizer of the total error function has 
a closed form solution. Specifically, if the data-dependent error Ed(w) 
is the sum-of-squares error as in Equation 4.2, then setting the gradient 
with respect to w to zero, then the solution is 



This regularizer is seen in ridge regression [Hoerl and Kennard, 2000], the 
support vector machine [Hoerl and Kennard, 2000, Schlkopf and Smola, 
2002] and regularization networks [Girosi et al., 1995]. 

2. q = 1 is called lasso regression in statistics [Tibshirani, 1996]. It has the 
property that if A is sufficiently large, then some of the coefficients u>i are 
driven to zero, which leads to a sparse model in which the corresponding 
basis functions play no role. To see this, note that the minimization of 
Equation 4.2 is equivalent to minimizing the unrcgularized sum-of-squares 
error subject to the constraint over the parameters: 



w= (AI + * T *) _1 $ T t 




(4.3) 



M 




(4.4) 
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Figure 4.1: Contours of the unregularized objective function (blue) along with 
the constraint region (yellow) with L2-regularizer (left) and Li-regularizer. The 
lasso regression gives a sparse solution. Figure excerpted from [Bishop, 2006]. 

The Lagrangian of Equation 4.3 gives Equation 4.2. The sparsity of the so- 
lution can be seen from Figure 4.1. Theoretical study has also shown that 
lasso Li regularization may effectively avoid over-fitting. In [Dudk et al., 
2004], it is shown that the density estimation in log-linear models using 
Li-regularized likelihood has sample complexity that grows only logarith- 
mically in the number of features of the log-linear model; Ng [2004] and 
Wainwright et al. [2006] show a similar result for L\ -regularized logistic 
regression respectively. 

The asymptotic properties of Lasso-type estimates in regression have been 
studied in detail in [Knight and Fu, 2000] for a fixed number of variables. 
Their results say that the regularization parameter A should decay for 
an increasing number of observations at least as fast as TV -1 / 2 to obtain 
A rl / 2 -consistent estimate where N is the number of observations. 

3. If 0° = is defined, then the Lq regularization contributes a fixed penalty 
at for each weight Wi ^ 0. If all on are identical, then this is equivalent 
to setting a limit on the maximum number of non-zero weights. How- 
ever, the Lq norm is not a convex function, and this tends to make exact 
optimization of the objective function expensive. 

In general, the L q norm has parsimonious property (with some components 
being exactly zero) for q < 1, while the optimization problem is only 
convex for q > 1. So L\ regularizer occupies a unique position, as q = 1 is 
the only value of q such that the optimization problem leads to a sparse 
solution, while the optimization problem is still convex. 
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4.2 Structure Learning through Regression 



Learning a graphical structure by regression is gaining popularity in recent 
years. The algorithms proposed mainly differ in the choice of the objective loss 
functions. They are listed in the following according to the different objectives 
they use. 

4.2.1 Likelihood Objective 

Methods in this category use the negative likelihood or log- likelihood of the data 
given the parameters of the model as the objective loss function -Ed(-). 

• In [in Lee et al., 2006], the authors proposed a L\ -regularized structure 
learning algorithm for Markov Random Field, specifically, in the frame- 
work of log-linear models. Given a variable set X = {xi, • • • ,x„}, a log- 
linear model is defined in terms of a set of feature functions /fc(xj,), each 
of which is a function that defines a numerical value for each assignment 
Xfc to some subset Xfe C X . Given a set of feature functions F — {fk}, 
the parameters of the log-linear model are weights 9 = {9k : fk € F}. The 
overall distribution is defined as: 



where Z(9) is a normalizer called partition function. Given an iid training 
dataset T>, the log-likelihood function is: 



where /fe(I?) = X)m=i fk(xk[m}) is the sum of the feature values over the 
entire data set, f (T>) is the vector where all of these aggregate features have 
been arranged in the same order as the parameter vector, and 9 T { (V) is a 
vector dot-product operation. To get a sparse MAP approximation of the 
parameters, a Laplacian parameter prior for each feature fk is introduced 
such that 




£{M,V) 



]T e k fk(T> - M log Z{6) = 9 T t(V) -M log Z(6) 



fk&F 



P(9 k ) = ?±exp(-0 k \6 k \) 



And finally the objective function is: 



max 
6 



9 J f \V) -M log Z (6) 



k 
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Before solving this optimization problem to get the parameters, features 
should be included into the model. Instead of including all features in 
advance, the authors use grafting procedure [Perkins et al., 2003] and 
gain-based method [Pietra et al., 1997] to introduce features into the model 
incrementally. 

• In [Wainwright et al., 2006], the authors restricted to the Ising model, a 
special family of MRF, defined as 

p(x, 9) = exp ^2 s x s + ^2 st x s x t - vp(0) 

\s£V (s,t)eE 

The logistic regression with Li-regularization that minimizing the negative 
log likelihood is achieved by optimizing: 

s ' A = argmin( - V(log(l+cxp(0 T z( J ^))-x«0 T z^)) +A„||0 V || 1 ) 

4.2.2 Dependency Objective 

Algorithms in this category use linear regression to estimate the Markov blanket 
of each node in a graph. Each node is considered dependent on nodes with 
nonzero weights in the regression. 

• In [Meinshausen and Biihlmann, 2006], the authors used linear regression 
with L\ regularization to estimate the neighbors of each node in a Gaussian 
graphical model: 

<? iiA = argmin -\\ Xi - T x||! + A||0||i 

The authors discussed in detail the choice of regularizer weight A, for which 
the cross-validation choice is not the best under certain circumstances. For 
the solution, the authors proposed an optimal choice of A under certain 
assumptions with full proof. 

• In [Fan, 2006], the authors proposed to learn GGM from directed graphical 
models using modified Lasso regression, which seems a promising method. 
The algorithm is listed here in detail. 

Given a GGM with variables x = [x\, ■ ■ ■ , x p ] T and the multivariate Gaus- 
sian distribution with covariance matrix E: 

p(x) =(2^vW^ exp H xTrt 

This joint probability can always be decomposed into the product of mul- 
tiple conditional probabilities: 

p 

P(x) = Y[P(xi\x i+h ..., p ) 
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Since the joint probability in the GGM is a multivariate Gaussian dis- 
tribution, each conditional probability also follows Gaussian distribution. 
This implies that for any GGM there is at least one DAG with the same 
joint distribution. 

Suppose that for a DAG there is a specific ordering of variables as 1, 2, • • • ,p. 
Each variable Xi only has parents with indices larger than i. Let (3 de- 
note the regression coefficients and D denote the data. The posterior 
probability given the DAG parameter j3 is 

p 

P(D\f3)=l[P(x l \ X{t+1):p ,f3) 

i=i 

Suppose linear regression Xi — Y^=i+i fijiXj + ei where the error ej follows 
normal distribution e, <~ N(0,ipi), then 

x = Lx + e 
e ~ JV p (0,*) 

Where T is an upper triangular matrix, r^- = i < j, e = (ei, • • • , e p ) T 
and ^ — diag(ip-L, • • • , ip p ). Thus 

X = {i-ryh 

So x follows a joint multivariate Gaussian distribution with covariance 
matrix and precision matrix as: 

e = (j-r)- 1 *((j-r)- 1 ) T 
ft = (/-r) T *- 1 (/-r) 

Wishart prior is assigned to the precision matrix ft such that ft ~ W P (S, T) 
with 5 degrees of freedom and diagonal scale matrix T = diag{6\, • • • , 6 P ). 
Each 9i is a positive hyper prior and satisfies 

Let fii — • • • ,/3 p i) T , and Tj represents the sub-matrix of T corre- 

sponding to variables X( i+1 ) :p . Then the associated prior for is P((3i\i/ji, 
N p -i(0,Tiipi) [Geiger and Heckcrman, 2002], thus: 

P{fi3i\il>i,6 j ) = N(Q,6j1> i ) 

And the associated prior for ^ is 

P( Vlw . r El) 
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where r(-) is the Gamma distribution. Like in [Figueiredo and Jain, 2001], 
the hyper prior 9 can be integrated out from prior distribution of ftj and 
thus 





1 ( A 

exp 



Suppose there are K samples in the data D and Xki is the ith variable in 
the fcth sample, then 



cx exp 



and 



P(^\e i ,/3 i ,D) = T 



' 5+p-i + K (9j 1 + Efc(^ - £j=i+i fti&Jw) 2 



The MAP estimation of ft is: 



ft = argmin^ \x M - ^ Pji x kj + V^h 

fc \ j'=*+i / j'=*+i 

ft is the solution of a Lasso regression. 

The authors further proposed a Feature Vector Machine (FVM) which is 
an advance to the the generalized Lasso regression (GLR) [Roth, 2004] 
which incorporates kernels, to learn the structure of undirected graphical 
models. The optimization problem is: 



arg min ]- ^ j3 p (3 q K(f p , f g ) 



S.t. 



where K(fi, fj) — </>(/i) T 4>{fj) is the kernel function, </>(•) is the mapping, 
either linear or non-linear, from original space to a higher dimensional 
space; fk is the fc-th feature vector, and y is the response vector from the 
training dataset. 
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4.2.3 System-identification Objective 

Algorithms in this category [Arkin et al., 1998, Gardner et al., 2003, Glass and 
Kauffman, 1973, Gustafsson et al., 2003, McAdams and Arkin, 1998] get ideas 
from network identification by multiple regression (NIR) , which is derived from 
a branch of engineering called system identification [Ljung, 1999], in which a 
model of the connections and functional relations between elements in a network 
is inferred from measurements of system dynamics. The whole system is mod- 
eled using a differential equation, then regression algorithms are used to fit this 
equation. This approach has been used to identify gene regulatory networks. 
Here the key idea of this type of approaches is illustrated by using the algorithm 
in [Gustafsson et al., 2003] as an example. 

Near a steady-state point (e.g., when gene expression does not change sub- 
stantially over time) , the nonlinear system of the genes may be approximated 
to the first order by a linear differential equation as: 

dx* n 

3 = 1 

where x\ is the expression of gene i at time t. The network of the interaction 
can be inferred by minimizing the residual sum of squares with constraints on 
the coefficients: 

argmin [ S - ^ ) 

n 

S.t. ^ \ w ij\ < Mi 

Note that this is essentially a Lasso regression problem since the constraints 
added to the Lagrangian is equivalent to L\ rcgularizcrs. Finally the adjancency 
matrix A of the network is identified from the coefficients by 

A l= l° [iw ^ = ° 
ll otherwise 

One problem with this approach is when the number of samples is less than the 
number of variables, the linear equation is undetermined. To solve this prob- 
lem, D'haeseleer et al. [1999] use non-linear interpolation to generate more data 
points to make the equation determined; Yeung et al. [2002] use singular value 
decomposition (SVD) to first decompose the training data, and then constrain 
the interaction matrix by exploring the sparseness of the interactions. 

4.2.4 Precision Matrix Objective 

In [Banerjee et al., 2006], the authors proposed a convex optimization algorithm 
for fitting sparse Gaussian Graphical Model from precision matrix. Given a 
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large-scale empirical dense covariance matrix S of multivariate Gaussian data, 
the objective is to find a sparse approximation of the precision matrix. Assuming 
X is the estimate of the precision matrix (the inverse of the variance matrix). 
The optimization of the penalized maximum likelihood (ML) is: 

max iogdet(X)-Tr(SX)-p\\X\\ 1 
The problem can be efficiently solved by Nesterovs method [Nesterov, 2005] . 
4.2.5 MDL Objective 

Methods in this category encode the parameters into the Minimum Description 
Length (MDL) criterion, and tries to minimize the MDL with respect to the 
regularization or constraints. 

• In [Schmidt and Murphy, 2007] , the authors proposed a structure learning 
approach which uses the L\ penalized regression with MDL as loss function 
to find the parents/neighbors for each node, and then apply the score- 
based search. The first step is the Li regularized variable selection to find 
the parents/neighbors set of a node by solving the following equation: 

9^{U) = argmin NLL{j,U,0) + A||0||i (4.5) 
e 

where A is the regularization parameter for the L\ norm of the parameter 
vector. NLL(j 7 U, 9) is the negative log-likelihood of node j with parents 
7r(j) and parameters 9: 

d \r\mle\ 

MDL(G) = ]T NLL(j, ttj , §f e + ^ log n (4.6) 

N 

NLL(j,n(j),6) = -J2^gP(X ij \X iMj) ,6) (4.7) 

»=1 

where N is the number of samples in the dataset. 

The L\ regularization will generate a sparse solution with many parame- 
ters being zero. The set of variables with non-zero values are set as the 
parents of each node. This hybrid structure learning algorithm is further 
discussed in Section 5.1. 

In general, this regression method is the same as the likelihood objected 
approaches, since the term of the description length of model in Equa- 
tion 4.7 is incorporated into the regularization term in Equation 4.5. 

• In [Guo and Schuurmans, 2006], the authors proposed an interesting struc- 
ture learning algorithm for Bayesian Networks, which incorporates pa- 
rameter estimation, feature selection and variable ordering into one single 
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convex optimization problem, which is essentially a constrained regression 
problem. The parameters of the Bayesian network and the feature selec- 
tor variables are encoded in the MDL objective function which is to be 
minimized. The topological properties of the Bayesian network (antisym- 
metricity, transitivity and reflexivity) are encoded as constraints to the 
optimization problem. 
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Hybrid 
Others 



Algorithms and 



5.1 Hybrid Algorithms 

Some algorithms perform the structure learning in a hybrid manner to utilize 
the advantages of constraint-based, score-based or regression-based algorithms. 
Here we list some of them. 

• Max-min Hill-climbing (MMHC): In [Tsamardinos et al., 2006], the au- 
thors proposed a Max-min Hill-climbing (MMHC) algorithm for structure 
learning of Bayesian networks. The MMHC algorithm shares the simi- 
lar idea as the Sparse Candidate Hill Climbing (SCHC) algorithm. The 
MMHC algorithm works in two steps. In the first step, the skeleton of the 
network is learned using a local discovery algorithm called Max-Min Par- 
ents and Children (MMPC) to identify the parents and children of each 
node through the conditional independency test, where the conditional 
sets are grown in a greedy way. In this process, the Max-Min heuristic 
is used to select the variables that maximize the minimum association 
with the target variable relative to the candidate parents and children. 
In the second step, the greedy hill-climbing search is performed within 
the constraint of the skeleton learned in the first step. Unlike the SCHC 
algorithm, MMHC docs not impose a maximum in-degree for each node. 

• In [Schmidt and Murphy, 2007], the authors proposed a structure learn- 
ing approach which uses the LI penalized regression to find the par- 
ents/neighbors for each node, and then apply the score-based search. The 
first step is the Li variable selection to find the parents/neighbors set of 
a node. The regression algorithm is discussed in Section 4.2.5. 

After the parent sets of all node are identified, a skeleton of the structure is 
created using the 'OR' strategy [Mcinshauscn and Biihlmann, 2006]. This 
procedure is called LIMB (Li-regularized Markov blanket). The LIMB is 
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plugged into structure search (MMHC) or ordering search [Teyssier and 
Roller, 2005]. In the application to MMHC, LIMB replaces the Sparse 
Candidate procedure to identify potential parents. In the application to 
ordering search in [Teyssier and Koller, 2005], given the ordering, LIMB 
replaces the SC and exhaustive search. 

5.2 Other Algorithms 

Besides the structure learning algorithms mentioned before, there are some other 
approaches. They are listed here. 

5.2.1 Clustering Approaches 

The simplest structure learning method is through clustering. First the similar- 
ities of any two variables are estimated, then any two variables with similarity 
higher than a threshold are connected by an edge [Lukashin et al., 2003]. Here 
the similarity may take different measures, including correlation [Eisen et al., 
1998, Spellman et al., 1998, Iyer et al., 1999, Alizadeh et al., 2000], Euclidean 
distance [Wen et al., 1998, Tamayo et al., 1999, Tavazoie et al., 1999] and mutual 
information [Butte and Kohane, 2000]. Using hierarchical clustering [Manning 
et al., 2008], the hierarchy structure may be presented at different scales. 

5.2.2 Boolean Models 

Some algorithms employed the Boolean models for structure learning in gene 
regulatory network reconstruction [Thomas, 1973, Akutsu et al., 1999, Liang 
et al., 1998]. These approaches assume the boolean relationship between regu- 
lators and regulated genes, and tried to identify appropriate logic relationship 
among gene based on the observed gene expression profiles. 

5.2.3 Information Theoretic Based Approach 

In [Basso et al., 2005], the authors employ the information theoretic approaches 
for reconstructing gene regulatory network. It first identify pairs of correlated 
genes based on the measurement of mutual information. It then eliminates 
indirect interaction among genes by applying the well-known staple of data 
transimission theory, the "data processing inequality" (DPI). There are two 
things unclear about this approaches: 1) Since the results could be sensitive 
to the order of elimination, it is important to provide a justification about the 
order of edges to be eliminated. 2) Most of the discussion within the paper is 
limited to loops with three edges. It is important to know how to address the 
cycles with more than three genes. 



39 



5.2.4 Matrix Factorization Based Approach 

Methods in this category use matrix factorization techniques to identify the 
interactions between variables. The matrix factorization algorithms used in- 
cludes singular value decomposition [Alter et al., 2000, D'haeseleer et al., 1999, 
Raychaudhuri et al., 2000], max-margin matrix factorization [DeCoste, 2006, 
Rennie and Srebro, 2005, Srebro et al., 2005] and non- negative matrix factor- 
ization [Badea and Tilivea, 2005, Paatero and Tapper, 1994, Hoyer and Dayan, 
2004, Lee and Seung, 2001, Shahnaz et al., 2006, Weinberger et al., 2005], net- 
work component analysis (NCA) [Liao et al., 2003]. Readers are referred to read 
a technical report [Jin and Zhou, 2006] for more details of this method. 
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